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Abstract 

This  paper  proposes  a  new  approach  to  object  searching  in  video  databases,  SoftCBIR,  which  com¬ 
bines  a  keypoint  matching  algorithm  and  a  graduated  assignment  algorithm  based  on  softassign. 
Compared  with  previous  approaches,  SoftCBIR  is  an  innovative  combination  of  two  powerful 
techniques:  1)  An  energy  minimization  algorithm  is  applied  to  match  two  groups  of  keypoints 
while  accounting  for  both  their  similarity  in  descriptor  space  and  the  consistency  of  their  geomet¬ 
ric  configuration.  The  algorithm  computes  correspondence  and  pose  transformation  between  two 
groups  of  keypoints  iteratively  and  alternately  toward  an  optimal  result.  The  objective  energy 
function  combines  normalized  distance  errors  in  descriptor  space  and  in  the  spatial  domain.  2) 
Initial  individual  keypoint  matching  relies  on  Approximate  A'- Nearest  Neighbor  (ANN)  search. 
ANN  achieves  much  more  accurate  initial  keypoint  matching  results  in  the  descriptor  space  than 
77- means  labeling.  Experiments  prove  the  effectiveness  of  our  approach,  and  demonstrate  the 
performance  improvements  rising  from  the  combination  of  the  two  proposed  techniques  in  the 
SoftCBIR  algorithm. 
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1  Introduction 


As  digital  video  is  becoming  more  prevalent  in  people’s  life,  content-based  video  analysis  has 
grown  into  a  very  active  research  area.  Object  searching  is  a  relatively  new  topic  [17,  13,  23,  24, 
11,  12,  14,  3,  18,  20]  in  content-based  video  analysis,  and  consists  of  searching  desired  objects  in 
videos  by  query  words  or  examples,  according  to  three  approaches:  1)  textual  only,  2)  visual  only, 
3)  textual  and  visual.  In  this  paper,  we  have  focused  on  the  visual  approach  only. 

Almost  all  researchers  agree  that  an  object  can  be  represented  by  a  constellation  of  keypoints 
with  a  specific  geometric  configuration  [17,  13,  23,  24,  3,  18,  20].  The  mainstream  approach  of 
object  searching  [17,  13,  23,  24,  20]  is  composed  of  three  steps:  detecting  keypoints,  computing 
descriptors,  and  matching.  In  the  first  step,  keypoints  are  detected  and  located  with  specific 
position,  scale  and  orientation.  The  second  step  computes  descriptors  at  keypoint  locations. 
The  third  step  matches  keypoints  by  their  descriptors.  The  matching  between  two  keypoints  is 
evaluated  by  the  distance  between  their  descriptors  in  the  descriptor  vector  space.  An  object  is 
represented  by  a  local  group  of  keypoints.  To  match  two  objects,  their  groups  of  keypoints  are 
matched  based  on  certain  criteria,  including  not  only  the  matching  between  individual  keypoints, 
but  also  some  measures  based  on  comparisons  of  the  geometric  configurations  between  the  two 
groups.  For  the  above  three  steps,  researchers  have  some  agreement  about  the  first  step  and  the 
third  step.  For  the  first  step,  typically,  keypoints  are  detected  at  extrema  positions  in  a  scale- 
space.  For  the  third  step,  typically,  people  enforce  either  neighborhood  consistency  or  geometric 
consistency  between  the  two  objects.  (Authors  have  used  the  term  spatial  consistency  to  express 
the  concept  of  neighborhood  consistency,  but  we  avoid  it  as  this  seems  confusing  when  examined 
at  the  same  time  as  geometric  consistency.)  Neighborhood  consistency  enforces  the  fact  that  a 
group  of  neighbor  keypoints  in  one  object  should  be  mapped  into  a  group  of  neighbor  keypoints 
in  the  other  object  [23].  Geometric  consistency  assumes  there  exists  a  planar  affine  transform 
or  homography  between  the  points  of  the  two  objects.  Geometric  consistency  is  stronger  than 
neighborhood  consistency,  but  more  complicated  to  compute.  On  the  other  hand,  for  step  2,  the 
reported  methods  are  very  diverse.  In  2004,  Mikolajczyk  and  Schmid  [19]  presented  a  comparative 
study  of  the  available  local  descriptors:  Harris- Affine  detector  [8],  shape  context  [2],  steerable 
filters  [4],  differential  invariants  [15],  spin  images  [16],  complex  filters  [22],  moment  invariant 
[7],  SIFT  [17],  PCA-SIFT  [13],  and  cross-correlation  of  different  types  of  keypoints.  That  work 
concludes  that  SIFT  performs  best  in  the  comparison  experiments.  However,  the  performance 
of  PCA-SIFT  is  not  far  behind.  In  addition,  its  acronym  is  somewhat  a  misnomer  in  the  sense 
that  it  does  not  use  the  histograms  of  gradient  directions  in  subregions  around  interest  points 
that  define  the  SIFT  approach.  Instead,  the  feature  vector  is  obtained  by  PCA  analysis  of  the 
raw  vectors  of  gradient  values  around  interest  points.  Furthermore,  the  PCA  step  results  in  much 
smaller  feature  vectors  (typically  20  dimensions  instead  of  128  dimensions). 

Sivic  and  Zisserman  proposed  several  video  mining  approaches,  including  Video  Google  [23]  in 
2003  and  a  related  approach  [24]  in  2004.  They  use  Lowe’s  keypoint  detector  and  SIFT  descriptor 
[17],  and  had  some  success  in  extracting  objects  in  videos.  But  there  are  three  shortcomings  in 
their  work.  1)  In  [23]  and  [24],  for  matching  between  groups  of  keypoints,  the  authors  argue  that 
computing  the  parameters  of  the  planar  homography  between  the  two  groups  using  iterations  is 
time-consuming,  and  they  adopt  a  simpler  testing  criterion,  computing  the  number  of  matched 
individual  keypoint  pairs  between  the  two  groups  after  neighborhood  consistency  filtering.  If 
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this  number  is  above  a  certain  threshold,  the  two  groups  are  thought  to  be  matched.  Geometric 
consistency  between  groups  is  not  considered.  This  simplification  will  produce  false  object  matches 
under  certain  circumstances.  2)  In  [23]  and  [24],  the  authors  use  A'-means  clustering  to  perform 
a  vector  quantization  on  the  set  of  keypoint  descriptors  extracted  from  the  keyframes  of  videos 
in  the  database.  The  number  of  clusters  K  is  set  manually.  A'-means  clusters  are  given  unique 
labels,  and  each  keypoint  is  given  the  label  of  the  cluster  it  belongs  to.  In  the  matching,  keypoints 
with  identical  labels  are  matched.  There  are  two  problems  in  this  method.  First,  the  A'-mcans 
algorithm  is  not  very  suitable  for  vector  quantization  in  large  and  high- dimensional  data  sets,  due 
to  its  two  inherent  drawbacks:  dependency  on  the  initial  state  and  degeneracy  [25].  The  initial 
state  includes  the  selection  of  initial  centers  and  the  value  of  K.  A'-means  is  non-deterministic, 
thus  final  results  depend  upon  the  initial  state,  because  the  mean  squared  error  often  converges 
to  a  local  minimum  [10].  But  the  local  minimum  cannot  guarantee  an  ideal  Voronoi  graph-like 
vector  quantization  result.  Instead,  the  K  clusters  in  the  A'-means  result  may  be  elongated  in  the 
high-dimensional  space.  Additionally,  the  value  of  K  is  also  critical  to  the  clustering  result,  and 
obtaining  an  optimal  K  for  a  given  data  set  is  an  AP-hard  problem  [25].  When  the  distribution 
of  the  data  set  is  unknown,  the  optimal  K  is  hard  to  attain.  A  manually  assigned  value  of  K 
may  not  be  suitable  for  the  data,  falsely  splitting  one  good  cluster  into  two,  or  missing  some 
obvious  splits,  or  both.  The  risk  of  degeneracy  implies  that  the  clustering  may  end  with  some 
meaningless  sparse  clusters.  Second,  A'-means  is  not  very  suitable  for  object  searching.  There 
may  be  a  lot  of  keypoints  lying  around  the  borders  between  clusters,  because  A'-means  does  not 
account  for  point  density.  Thus,  it  does  not  hold  that  two  keypoints  assigned  the  same  labels 
are  among  the  closest  neighbors  in  the  descriptor  vector  space.  This  fact  will  cause  errors  in  a 
matching  algorithm  based  on  A'-means  results.  3)  They  use  “raw”  SIFT  descriptors,  which  are 
128-dimensional  vectors.  Considering  that  there  are  hundreds  or  thousands  of  keypoints  in  a  single 
ordinary  video  frame,  the  distance  computation  between  SIFT  descriptors  in  the  128-dimensional 
vector  space  for  thousands  of  frames  is  overwhelming.  Such  a  high-dimensional  descriptor  is 
impractical  for  a  usable  video  analysis  system,  which  requires  near  real-time  response  to  users’ 
interactions. 

In  this  paper,  we  describe  a  new  approach  to  object  searching  in  videos  called  SoftCBIR  that 
addresses  the  above  three  problems.  It  combines  two  innovative  components.  (1)  An  energy 
minimization  framework  is  proposed  to  update  the  pose  transformation  and  correspondence  al¬ 
ternately  and  iteratively.  We  accomplish  this  by  extending  the  ideas  of  softassign  [6] .  Softassign 
is  a  probabilistic  framework  that  lets  all  the  available  data  participate  in  the  updates  of  the  pose 
transformation  and  correspondence.  So  it  is  less  vulnerable  to  the  selection  of  the  initial  state  than 
the  incremental  method  of  [20] .  The  energy  minimization  combines  the  use  of  keypoint  matching 
and  softassign  to  evaluate  the  matching  between  two  groups  of  keypoints,  by  evaluating  the  sim¬ 
ilarity  in  keypoint  descriptor  vector  space  and  the  geometric  consistency  between  two  groups  of 
keypoints  simultaneously.  The  iteration  gives  an  optimal  result  for  both  correspondence  and  pose 
transformation.  (2)  We  use  Approximate  AT-Nearest  Neighbor  (ANN)  [1]  to  perform  the  initial 
individual  keypoint  matching  in  the  descriptor  vector  space  instead  of  the  A'-means  used  by  [23] 
and  [24],  ANN  searching  is  much  more  accurate  than  searching  by  A'-means  labeling.  Although 
naive  nearest  neighbor  would  be  very  time-consuming,  ANN  uses  a  balanced  box-decomposition 
(BBD)  tree,  which  is  efficient  to  build  and  search.  It  also  provides  a  parameter  to  adapt  the 
tradeoff  between  accuracy  and  efficiency.  Evaluation  shows  that  it  can  achieve  both  fairly  high 
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accuracy  and  efficiency  [1].  The  SoftCBIR  algorithm  is  the  novel  combination  of  these  two  com¬ 
ponents.  In  addition,  since  it  uses  PCA-SIFT,  it  handles  feature  vectors  with  more  manageable 
numbers  of  dimensions. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  gives  an  overview  of  our  system.  Sections 
3  and  4  present  details  about  the  energy  minimization  algorithm  and  ANN.  Section  5  shows  results 
of  experiments  and  evaluations.  Conclusions  and  future  work  are  addressed  in  Section  6. 

2  Overview 

In  a  preprocessing  stage,  one  keyframe  is  extracted  for  each  shot  in  the  video  database,  and 
keypoints  (including  their  ^-position,  y-position,  scale  and  orientation)  and  their  descriptors  are 
extracted  using  the  PCA-SIFT  approach  for  each  keyframe.  Then  a  BBD  tree  is  built  using 
the  whole  set  of  descriptors  for  all  the  videos.  Below  is  an  outline  of  the  process  for  our  object 
searching  approach  once  the  user  opens  a  query  image  and  draws  a  rectangle  to  define  a  query 
object. 

1.  Extract  the  keypoints  and  their  descriptors  in  the  user’s  query  rectangle. 

2.  For  each  keypoint  in  the  query  rectangle,  search  its  K  nearest  neighbor  keypoints  as  matched 
keypoints  with  ANN  using  the  BBD  tree  built  in  the  preprocessing  stage.  In  our  experiments, 
K  is  set  to  14  in  ANN. 

3.  Perform  the  energy  minimization  algorithm  for  each  keyframe  containing  a  subset  of  the 
matched  keypoints  obtained  by  ANN. 

4.  For  each  keyframe,  define  a  matching  score  equal  to  the  number  of  final  matched  keypoints 
after  the  energy  minimization  algorithm,  and  determine  a  rectangle  framing  all  matched 
keypoints  as  the  matched  object  area. 

5.  Rank  the  keyframes  in  the  database  according  to  their  matching  scores. 

6.  Submit  ranked  results:  the  user  can  choose  to  view  top-ranked  keyframes. 

3  Energy  minimization 

Graduated  assignment  is  an  iterative  registration  algorithm  that  in  particular  is  used  to  register 
two  sets  of  2D  points  in  medical  images  [6].  For  two  groups  of  points  in  2D  or  3D  space,  it 
computes  the  correspondence  and  pose  transformation  iteratively  and  simultaneously,  using  a  fast 
deterministic  annealing  mechanism.  The  objective  is  to  minimize  an  energy  function,  which  is  a 
sum  of  distance  errors  between  pairs  of  matched  keypoints  in  the  spatial  domain.  In  our  object 
searching  framework,  after  achieving  an  initial  individual  matching  of  keypoints,  it  is  necessary  to 
check  the  geometric  consistency  between  the  two  groups  of  keypoints  because  of  the  following  two 
reasons:  1)  Some  individual  matchings  of  keypoints  are  wrong.  2)  Correct  individual  matching 
of  keypoints  cannot  guarantee  that  the  two  groups  of  keypoints  are  matched  when  they  exhibit 
similar  geometric  configurations.  This  geometric  consistency  problem  is  more  complex  than  the 
purely  spatial  matching  problem  that  the  original  softassign  algorithm  was  meant  to  solve.  Indeed 
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we  should  consider  not  only  the  distance  error  in  the  spatial  domain,  but  also  the  distance  error  in 
the  descriptor  vector  space  of  keypoints.  In  the  energy  minimization  of  our  SoftCBIR  algorithm, 
we  combine  these  two  distances  into  a  new  energy  function  and  apply  deterministic  annealing 
iterations  similar  to  those  of  softassign  to  find  video  frames  containing  groups  of  keypoints  that 
match  the  keypoints  of  query  rectangles  with  similar  individual  descriptors  and  similar  geometric 
configurations. 

Consider  a  query  image  and  a  test  video  frame.  The  query  image  has  a  list  of  J  keypoints 
pp  =  (xp ,  yp)  with  descriptors  dP  in  the  descriptor  space.  The  test  video  frame  has  a  list  of 

k  keypoints  pP  =  (xp ,  i/p )  with  descriptors  d!p  in  the  descriptor  space. 

We  use  a  correspondence  matrix  M  to  indicate  the  correspondence  between  the  J  keypoints  in 
the  query  image  and  the  K  keypoints  in  the  test  video  frame.  Mjk  indicates  the  probability  of 
correspondence  between  pP  and  pp .  The  matrix  M  is  a  (J  +  1)  x  (K  +  1)  matrix  because  an 
additional  slack  row  and  an  additional  slack  column  are  needed  to  indicate  when  no  acceptable 
correspondence  has  been  found  for  a  keypoint  in  the  query  image  or  in  the  test  frame  respectively. 
At  each  step  of  iteration,  the  Sinkhorn  algorithm,  an  iterative  normalization  of  rows  and  columns, 
is  applied  to  M  to  insure  that  each  row  and  each  column  sum  up  to  one. 

We  look  for  an  affine  transform  A  from  the  group  of  keypoints  in  the  query  image  to  the  group 
of  keypoints  in  the  test  frame,  defined  as 


A  = 


a i  o2  a3 

bx  b2  h 
0  0  1 


(1) 


Thus  the  point  A  pP  is  the  corresponding  spatial  position  for  pp  after  the  affine  transform. 
The  objective  energy  function  E  is  then  defined  as 


J  K 


j= l  k= l  W  ad 


(2) 


where  Ds(-,  •)  is  the  Euclidean  distance  function  in  the  spatial  domain  and  Da(-,-)  is  the  Euclidean 
distance  function  in  the  descriptor  vector  space,  and  as  and  a,i  are  normalization  parameters  for 
the  two  distances. 

The  energy  minimization  used  by  SoftCBIR  is  a  deterministic  annealing  procedure  in  which 
a  temperature  T  =  1/(3  is  initially  high  so  that  small  local  minima  are  first  smoothed  out  at 
the  expense  of  accuracy  and  is  progressively  decreased  so  that,  once  a  large  minimum  is  found, 
landscape  details  are  provided  for  increased  accuracy.  At  each  temperature  level,  the  first  step 
is  to  update  M  using  the  current  squared  spatial  distance  errors  D^(A  pp) ,  p[2) )  and  squared 

descriptor  distances  D/,(dp ,  d(p  ).  The  second  step  is  to  update  the  affine  transformation  A  by 
solving  in  closed  form  for  the  affine  transform  parameters  that  set  the  derivative  dE/dA  =  0, 
thereby  minimizing  the  energy.  The  code  of  this  component  of  our  SoftCBIR  algorithm  can  be 
summarized  as  follows: 

Inputs: 

1.  For  the  query  image,  a  list  of  J  keypoints  pP  =  (-p  \  yp  )  with  descriptors  dp1  in  the 
descriptor  space. 


5 


2.  For  the  test  video  frame,  a  list  of  K  keypoints  p P  =  {xf\y^)  with  descriptors  d/p  in  the 
descriptor  space. 

Initialize  slack  elements  of  assignment  matrix  M  to  7  =  1/ (rnax{J.  K}  +  1),  (3  to  (3q. 
Initialize  A  to  an  expected  pose  transformation  (see  below  for  details). 

Do  A  until  (5  >  (3finai  or  E  <  EJthreshold  (deterministic  annealing  loop) 

-  Compute  combined  squared  distances  D/k  =  D2s(Ap'\p/p)  /  as  +  D/idp ,  dp) /a(i 

-  Update  Mjk  =  7 

-  Do  B  until  AM  small  (Sinkhorn  algorithm) 


K+ 1 

Update  Matrix  M  by  normalizing  each  row  except  slack  row:  Mjk  =  Mjk/  X)  AI:)k 

k  I 


J+ 1 

Update  Matrix  M  by  normalizing  each  column  except  slack  column:  Mjk  =  Mjk/  X)  Mlk 

3= 1 


-  End  Do  B 

J  K 

-  Compute  energy  E  =  J2  E  Mjk(Djk) 

j= 1  k= 1 


-  Update  A  by  minimizing  the  energy  E,  i.e.,  solve  the  equation  dE/dA 
parameters  of  A  as  variables  (see  below  for  details) 

-  [3  (3 update  ft 


0  with  the 


End  Do  A 

Outputs:  Affine  matrix  A  and  correspondence  matrix  M  between  the  groups  of  keypoints  in 
query  image  and  test  video  frame. 


In  other  words,  the  energy  minimization  in  SoftCBIR  produces  a  high  probability  of  correspon¬ 
dence  between  pairs  of  keypoints  that  are  close  in  descriptor  space  and  also  in  the  spatial  domain 
after  affine  transformation,  thanks  to  the  Gaussian  relationship  between  distance  and  correspon¬ 
dence  of  the  update  step  AI:)k  =  Note  the  multiplying  term  (3,  which  is  the  inverse  of  the 

temperature.  In  the  first  iterations,  (3  is  small,  so  that  the  algorithm  does  not  pay  much  attention 
to  distances,  and  probabilities  tend  to  be  equally  distributed  between  several  possible  pairings 
between  keypoints.  It  is  only  after  several  iterations,  when  the  affine  transformation  becomes 
more  correctly  calculated,  that  the  parameter  (3  becomes  large  enough  and  the  algorithm  starts 
to  fully  account  for  the  distances  between  points,  which  in  turn  improves  the  affine  transform 
calculation.  In  our  experiments,  we  use  (3q  =  1.2,  f3update  =  1-05,  and  (3finai  =  10.  Loop  B  within 
loop  A  enforces  normalization  to  one  for  the  probabilities  of  each  row  and  each  column.  The 
combined  result  of  normalization  and  exponentiation  of  distances  is  a  winner-take-all  mechanism 
in  which  one  term  in  each  row  and  each  column  tends  to  grow  closer  to  one  at  the  expense  of  the 
other  terms.  If  the  combined  distances  between  a  transformed  keypoint  and  all  the  keypoints  of 
its  row  remains  large,  then  the  winner  becomes  the  extra  term  of  the  slack  column,  therefore  a 
large  value  in  the  slack  column  indicates  that  no  match  was  found  for  a  keypoint  of  the  query 
image.  Similarly,  elements  of  the  slack  row  are  large  when  keypoints  of  a  video  frame  do  not 
correspond  to  any  keypoints  of  the  query  image. 

At  each  step,  the  parameters  of  the  affine  transformation  A  that  minimize  the  energy  can  be 
found  in  closed  form  by  solving  the  two  3x3  systems: 
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After  the  energy  minimization  stage,  we  transform  the  correspondence  matrix  M,  which  ex¬ 
presses  probabilities  of  correspondences  between  keypoints  of  the  query  image  and  the  test  video 
frame,  into  a  hard  assignment  matrix  with  only  zeros  and  ones:  if  Mjk  is  the  unique  maximum  in 
both  the  jth  row  and  the  kth  column  and  M:ip.  is  larger  than  a  threshold  9  =  0.9,  then  Mjk  is  set 
to  one,  and  if  this  term  is  not  in  a  slack  row  or  column,  the  keypoint  p,-  ]  and  p[2)  are  considered 
to  be  matched  keypoints. 

At  this  stage  we  adopt  the  number  of  surviving  matched  keypoints  as  the  matching  score.  The 
frames  are  ranked  by  their  matching  score  and  the  ranked  list  is  delivered  as  the  object  searching 
result. 

The  deterministic  annealing  algorithm  is  slow  if  the  affine  matrix  A  is  initialized  with  random 
values.  To  speed  up  the  convergence  of  the  algorithm,  we  estimate  an  initial  affine  matrix  A  using 
the  initial  matched  keypoints  provided  by  ANN.  The  routine  for  producing  this  initial  value  for 
A  is  similar  to  one  run  of  the  loop  A  in  the  pseudocode,  in  which  M  is  computed  using  only 

distances  in  the  descriptor  space,  i.e.  as  Mjk  =  'ye-13  D^dj  )/“d. 

In  Section  2,  we  explained  that  the  K  used  for  ANN  is  14.  For  each  keypoint  inside  the  query 
rectangle  of  the  query  image,  we  find  its  14  nearest  neighbors  in  the  keypoint  descriptor  space 
among  all  the  keypoints  in  all  the  keyframes  in  the  database.  So  in  each  test  frame,  for  each 
query  keypoint,  we  may  find  several  matched  keypoints.  In  other  words,  in  the  initialization 
stage  of  SoftCBIR  algorithm,  there  can  be  one-to-many  or  many-to-one  matches.  But  after  the 
energy  minimization,  the  thresholding  of  the  correspondence  matrix  into  a  hard  assignment  matrix 
guarantees  one-one  matches.  Fig.  1  illustrates  this  mechanism. 


4  ANN 

ANN  search  is  used  to  generate  candidate  matches  as  inputs  to  the  energy  minimization  algorithm. 
As  mentioned,  A'-means  has  inherent  drawbacks  for  keypoint  matching.  Nearest  neighbor  (NN) 
search  is  better  suited  for  this  task.  Given  a  set  S  of  n  data  points  in  a  d-dimension  feature  space 
F ,  the  goal  of  NN  searching  is  to  find  the  data  points  in  S  closest  to  a  given  query  point  q  in  F. 
But  the  brute- force  search  requires  O(dn)  time,  which  is  unacceptable  in  many  practical  problems. 
A  K- d  tree  can  be  used  to  address  the  NN  problem  efficiently  [5]  and  find  K  nearest  neighbors 
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Query  image  Result  frame  Query  image  Result  frame 

(a)  (b) 

Figure  1:  A  diagram  example  of  matched  keypoints  before  energy  minimization  (a)  and  after  it 
(b).  In  (a),  identical  numbers  indicate  they  are  nearest  neighbors  found  by  ANN.  In  (b)  identical 
numbers  indicate  they  belong  to  matched  pairs. 


in  logarithmic  time.  But  in  fact,  it  is  logarithmic  only  for  fairly  restricted  input  distributions  in 
low- dimensional  space.  For  certain  inputs  or  in  high- dimensional  space,  the  AT-d  tree  search  can 
be  linear  in  time.  In  light  of  these  difficulties,  an  alternative  approach  is  to  find  the  approximate 
nearest  neighbors  (ANN)  [1],  which  is  formulated  as:  given  a  set  S  of  n  data  points  and  a  query 
point  q  in  a  d-dimension  feature  space  F,  find  a  p  in  S  such  that  p  is  a  (l+e)-approximate  nearest 
neighbor  of  q. 


dist(p,  q)  <  (1  +  e)dist(p*,  q).  (5) 

where  p*  is  the  true  nearest  neighbor  to  q.  Arya  et  al.  [1]  show  that  in  0(d  n  logn)  time  it 
is  possible  to  construct  a  data  structure  of  size  0(dn)  such  that  an  ANN  p  can  be  reported 
in  0 ( c,j£  logn)  time  for  a  constant  c,\.e  <  d\l  +  Qd/e]d  and  any  Minkowski  metric.  The  data 
structure  used  in  [1]  is  a  balanced  box-decomposition  (BBD)  tree  that  hierarchally  decomposes 
the  n  data  points  into  a  collection  of  cells,  each  of  which  is  either  a  d-dimensional  rectangle  or  a 
set-theoretic  difference  of  two  rectangles,  one  enclosed  within  the  other.  In  our  system,  we  build 
a  BBD  tree  to  store  the  whole  set  of  keypoints  obtained  from  all  the  frames  of  the  videos.  In 
ANN,  we  use  s  =  0.1  to  increase  speed  without  adversely  reducing  accuracy.  The  I\  in  ANN  is  an 
adjustible  parameter  of  our  system.  It  is  easily  adjustable  by  users  to  get  suitable  for  a  database 
they  have.  Generally,  the  more  smilar  images  in  the  database,  the  larger  K  should  be. 

5  Experiments  and  Evaluation 
5.1  Data 

We  use  ABC  news  videos  provided  by  TRECVID  2004  [9]  as  our  training  data  and  testing  data. 
Using  the  ground  truth  of  shot  boundaries  provided  by  TRECVID  2004,  we  extract  one  keyframe 
(the  middle  frame)  for  each  given  shot,  and  extract  their  keypoints  and  PCA-SIFT  descriptors. 
Table  1  describes  the  training  set  and  testing  set.  A  BBD  tree  is  built  using  all  these  PCA-SIFT 
descriptors.  Both  the  training  set  and  the  testing  set  contain  not  only  news  but  also  commercials. 

The  evaluation  of  our  framework  is  conducted  at  the  object  level.  Ground  truth  objects  were 
manually  labeled  and  framed  with  rectangles  by  five  students  working  independently.  These 


Table  1:  Training  set  and  Testing  set 


Training  set 

Testing  set 

Files 

“199812 10_ABCa.mpg” 
“19981221_ABCa.mpg” 

“19981004_ABCa.mpg” 

“19981021_ABCa.mpg” 

“19981 109_ABCa.mpg” 

“19981 126_ABCa.mpg” 

Length 

56  minutes  and  47  seconds 

1  hour  53  minutes  and  40  seconds 

Number  of  keyframes 

844 

1,735 

Number  of  keypoints 

391,213 

818,997 

objects  have  diverse  semantics,  including  human  faces,  superimposed  texts,  logos,  buildings,  ani¬ 
mation  figures,  banners,  computers,  maps,  ties,  etc.  Only  those  objects  that  occurred  in  at  least 
two  different  keyframes  were  labeled.  The  resulting  total  number  of  labeled  objects  is  222.  The 
number  of  occurrences  of  objects  varies  from  2  to  32. 

5.2  Method 

In  the  evaluation  of  our  object  searching  approach,  we  use  all  the  instances  of  all  the  objects 
labeled  in  the  ground  truth  as  queries  in  turn.  Results  are  delivered  as  ranked  frame  lists,  along 
with  rectangles  indicating  the  detected  matching  object  area.  For  each  query,  the  query  frame 
itself  is  automatically  filtered  out  of  the  result  list.  For  each  result  frame,  if  the  detected  object 
area  overlaps  an  instance  of  the  current  query  object,  it  is  counted  as  a  correct  result;  otherwise 
a  false  positive  is  reported. 

5.3  Results 

We  present  our  results  with  three  methods,  (1)  Mean  Precision  Recall  Curve,  (2)  Mean  R-Precision 
and  (3)  Mean  Average  Precision  over  all  objects  [21].  To  analyze  the  contributions  of  the  energy 
minimization  algorithm  and  ANN  to  our  SoftCBIR  approach,  we  perform  the  following  three 
experiments:  (1)  with  energy  minimization  and  ANN  (our  proposed  approach),  (2)  with  ANN 
and  without  energy  minimization  algorithm,  (3)  with  energy  minimization  algorithm  and  without 
ANN  (using  W-means  instead).  Fig.  2  illustrates  the  comparison  between  the  three  results  with 
the  mean  precision  recall  curves.  In  Fig.  2,  the  solid  curve  is  for  (1),  the  dashed  curve  is  for  (2), 
and  the  dotted  curve  is  for  (3).  It  is  clear  that  our  approach  combining  both  energy  minimization 
and  ANN  outperforms  the  other  two  by  a  large  margin.  Table  2  tabulates  the  detailed  values  on 
the  curve  of  our  approach.  Table  3  compares  the  Mean  R-Precisions  and  Mean  Average  Precision 
of  the  three  experimental  results.  For  general  purpose  object  searching  in  a  large-scale  video 
database,  the  results  of  our  approach  are  encouraging.  In  this  experiment,  the  K  for  A'- means  is 
chosen  as  11700  because  this  value  yields  the  highest  results. 

Fig.  3  to  Fig.  6  show  some  examples  of  valid  frames  returned  by  SoftCBIR.  The  results  in 
Fig.  3,  4  and  6  are  perfect,  in  the  sense  that  all  the  relevant  frames  listed  in  ground  truth  are 
delivered  at  the  top  of  the  ranking.  Fig.  5  shows  two  false  positives  -  Result  11  and  Result  12. 
Fig.  7  shows  four  queried  objects  that  failed  to  return  any  correct  results.  From  Fig.  7  we  can 
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Mean  Precision  Recall  Curves 


Figure  2:  Comparison  between  approaches  in  Mean  Precision  Recall  Curve 


Table  2:  Points  on  Mean  Precision  Recall  Curve  of  our  approach 


Recall 

0.00 

0.05 

0.10 

0.15 

0.20 

0.25 

0.30 

0.35 

0.40 

0.45 

0.50 

Precision 

1.00 

0.95 

0.91 

0.86 

0.82 

0.78 

0.74 

0.71 

0.68 

0.65 

0.62 

Recall 

0.55 

0.60 

0.65 

0.70 

0.75 

0.80 

0.85 

0.90 

0.95 

1.00 

Precision 

0.60 

0.57 

0.55 

0.52 

0.50 

0.48 

0.45 

0.43 

0.40 

0.38 

see  that  our  approach  does  not  work  well  for  the  following  types  of  objects:  1)  Objects  with  no 
or  few  keypoints  because  of  characteristics  of  the  object  such  as  blur,  low  resolution  or  shading, 
for  example  in  (a)-(c).  2)  Objects  with  perspective  change  beyond  30  degrees,  for  example,  (c). 
3)  Objects  with  too  high-level  semantic  meaning,  for  example,  (d). 

5.4  Speed 

We  tested  the  system  speed  using  all  the  instances  of  all  the  objects  in  the  ground  truth  as 
queries  in  turn.  For  each  such  query,  we  search  it  in  the  whole  testing  set  containing  1735  images. 
On  a  computer  with  Pentium  4  CPU  2.4GHz  and  1GB  RAM,  the  average  running  time  of  one 
object  searching  task  is  28.71  seconds,  the  minimum  time  is  less  than  3  second,  and  the  maximum 
time  is  170  seconds.  With  the  same  hardware  conditions,  if  David  Lowe’s  raw  SIFT  descriptor 
are  used,  the  average  running  time  of  one  object- searching  task  is  45  hours.  This  is  very  slow 
because  of  the  constant  q,£  <  d  [1  +  6d/e]d  in  Section  4,  which  increases  exponentially  with  d 


Table  3:  Comparison  in  Mean  R-Precision  and  Mean  Average  Precision 


Mean  R-Precision 

Mean  Average  Precision 

With  energy  minimization  and  ANN 

0.478 

0.505 

With  ANN,  without  energy  minimization 

0.411 

0.452 

With  energy  minimization,  without  ANN 

0.206 

0.242 
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Query  Result  1  Result  2  Result  3  Result  4  Result  5 


Result  6  Result  7  Result  8  Result  9  Result  10  Result  11 


Result  12  Result  13  Result  14  Result  15  Result  16  Result  17 


Result  18 

Figure  3:  Search  results  for  the  object  “Carole  Simpson” 


Result  6  Result  7  Result  8  Result  9  Result  10  Result  11 


Result  12 


Figure  4:  Search  results  for  the  object  “CIA  seal” 


Query  Result  1  Result  2  Result  3  Result  4  Result  5 


Result  6  Result  7  Result  8  Result  9  Result  10  Result  11 


Result  12  Result  13 


Figure  5:  Search  results  for  the  object  “ABCNEWS  logo” 
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Query  Result  1  Result  2  Result  3  Result  4  Result  5 


Result  6  Result  7  Result  8 


Figure  6:  Search  results  for  the  object  “Superimposed  text  in  Aetna  advertisement” 


(a) 


(c) 


(d) 


Figure  7:  Unsuitable  objects 


(the  dimension  of  descriptors) .  This  comparison  proves  the  advantage  of  using  the  smaller  number 
of  feature  dimensions  provided  by  PCA-SIFT  in  combination  with  efficient  K- NN  in  real-world 
applications. 

6  Conclusions  and  future  work 

In  this  paper,  we  proposed  a  new  keypoint  based  object  searching  approach  called  SoftCBIR. 
Our  main  contributions  include  developing  the  energy  minimization  algorithm  to  enforce  optimal 
geometric  consistency  and  descriptor  matching  between  two  groups  of  keypoints,  and  using  the 
Approximate  Nearest  Neighbor  (ANN)  for  the  keypoint  matching  instead  of  /v -means.  Experi¬ 
ments  on  ABC  news  videos  with  1,735  keyframes  in  the  testing  set  prove  the  effectiveness  of  our 
approach.  We  also  pointed  out  object  types  that  are  not  suitable  for  our  approach. 

In  future  work,  we  will  explore  the  following  directions:  1)  Expand  the  current  data  set  to  a 
larger  subset  of  the  TRECVID  videos,  2)  Try  to  improve  keypoint  detection  and  representation,  to 
increase  robustness  to  3D  projection  and  non-rigid  transformations,  and  3)  Provide  mechanisms 
to  bridge  the  gap  between  the  text  queries  of  TRECVID  and  the  image  queries  of  SoftCBIR,  for 
example  by  using  internet  search  to  provide  images  that  best  illustrate  the  concepts  of  the  text 
queries. 

7  Appendix  1 

We  also  developed  a  heuristic  approach  to  check  the  neighborhood  consistency  and  geometry 
consistency.  This  approach  is  faster  than  the  energy  minimization  method.  Its  drawback  is 
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that  it  assumes  that  the  scale  changes  between  the  query  region  and  the  candidate  regions  are 
not  large.  This  heuristic  approach  includes  a  neighborhood  consistency  filter  and  a  geometry 
consistency  filter. 

7.1  Neighborhood  consistency  filter 

For  object  matching,  only  those  matched  keypoint  pairs  with  enough  supports  from  their  neigh¬ 
borhoods  are  reliable.  Therefore  two  filters  are  employed  to  check  neighborhood  consistency,  one 
for  filtering  out  neighborhood  outliers  and  one  for  filtering  out  key  points  that  do  not  satisfy  the 
local  mapping  constraints.  With  these  two  filters,  we  effectively  filter  out  false  alarms  or  give  them 
low  scores  for  the  ranking,  and  refine  the  scores  and  object  locations  for  the  remaining  correct 
results. 

7.1.1  Filtering  out  neighborhood  outliers 

Among  the  matched  keypoints,  a  neighborhood  outlier  in  the  result  image  is  unreliable.  Although 
the  KNN  based  PCA-SIFT  matching  is  generally  good,  there  are  still  errors  in  individual  keypoint 
pairings.  Some  of  these  pairings  lie  in  the  result  image  by  themselves  without  supporting  neighbor 
pairings,  thus  should  be  filtered  out.  Fig.  8  and  Fig.  9  show  such  an  example.  Fig.  8  and  Fig. 
9  are  results  of  object  searching  in  a  key  frame  database  (including  440  keyframes)  from  a  video 
in  TRECVID  2004.  In  Fig.  8,  (a)  is  the  query  image  with  the  logo  of  a  detergent  brand  as  the 
query  object.  The  top  seven  ranked  object  searching  results  without  neighborhood  consistency 
filtering  are  shown  from  (b)  to  (h).  The  rectangles  locate  the  detected  object  and  white  points 
indicate  matched  keypoints  positions.  Fig.  8  also  shows  their  scores.  It  is  easy  to  see  that  only 
(b)  and  (h)  are  correct  results,  and  others  are  false  alarms.  Even  in  (b)  and  (h),  the  detected 
object  regions  are  too  large  compared  to  the  true  object.  Fig.  9  shows  the  top  four  ranked  results 
after  filtering  out  neighborhood  outliers.  Here  the  first  two  results  are  the  correct  ones.  The  score 
of  the  third  result  is  much  lower  than  those  of  the  first  two  results.  Furthermore  the  detected 
object  regions  in  Fig.  9  (a)  and  (b)  are  much  better  than  those  in  Fig.  8  (b)  and  (h). 


(e)  Score:  44  (f)  Score:  37  (g)  Score:  35  (h)  Score:  34 

Figure  8:  Object  search  example  for  criteria  without  neighborhood  consistency  filter 
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(a)  Score:  52  (b)  Score:  30  (c)  Score:  16  (d)  Score:  6 

Figure  9:  Object  search  example  for  criteria  with  filtering  out  of  neighborhood  outliers 

We  now  turn  to  describing  the  algorithm  in  detail.  Suppose  the  set  of  matched  keypoints  is  S 
in  a  result  image.  For  a  keypoint  (x,  y )  in  S,  we  compute  the  local  density  of  matched  keypoints 
in  the  image  plane  around  (x,y)  using  (6).  If  the  density  is  less  than  a  threshold,  keypoint  (x,y) 
is  removed  from  the  matched  keypoint  set. 

V/W“a:)2+(2/i-2/)2 

density  (x,  y)  =  e  i=1  (6) 

where  (xi,yi), ...,  (xn,yn)  are  the  closest  n  keypoints  to  (x,y)  in  S  with  2D  Euclidean  distance, 
with  n  being  an  adjustable  system  parameter.  In  our  current  system,  n= 2. 

7.1.2  Examining  local  mapping 

Filtering  out  neighborhood  outliers  is  the  hrst  step  of  neighborhood  consistency  filtering.  After 
removing  neighborhood  outliers,  there  may  still  be  matches  between  keypoint  pairs  which  do  not 
satisfy  neighborhood  consistency.  For  each  matched  keypoint  pair,  we  examine  the  local  mapping 
around  them  in  their  respective  neighborhoods.  If  there  are  no  enough  support  pairings  from  their 
neighborhoods  for  their  matching,  this  pair  will  be  filtered  out.  Fig.  10  and  Fig.  11  show  such 
an  example. 


(a)  Query  image  (b)  Image  24  (c)  Image  46 


(d)  Result  1,  Image  46,  score:  478  (e)  Result  2,  Image  24,  score:  310 

Figure  10:  Object  search  example  without  examining  local  mapping 


Fig.  10  and  Fig.  11  are  also  results  of  object  searching  in  the  same  keyframe  database  as 
Fig.  8  and  Fig.  9.  In  Fig.  10,  (a)  is  the  query  image  with  as  query  object  the  title  of  The  New 
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(a)  Result  1,  Image  24,  score:  212 


(b)  Result  2,  Image  46,  score:  95 


Figure  11:  Object  search  example  while  examining  local  mapping 


York  Times  in  gothic  font,  (b)  and  (c)  are  two  keyframes  in  the  database  with  frame  IDs  24  and 
46  respectively.  Image  24  and  the  query  image  share  exactly  the  same  object  -  ’’The  New  York 
Times”.  And  Image  46  and  the  query  image  share  similar  objects,  such  as  text  with  the  same 
gothic  font.  The  top  two  ranked  object  searching  results  without  examining  local  mapping  are 
shown  in  (d)  and  (e),  where  Image  46  is  ranked  first  and  Image  24  is  ranked  second.  This  result 
is  encouraging  because  the  scores  of  Image  46  and  Image  24  are  higher  than  those  of  other  frames 
in  the  database,  so  we  are  successfully  capturing  one  important  feature  of  the  object  -  gothic  font. 
On  the  other  hand,  the  result  is  not  perfect  because  Image  46  gets  higher  score  than  Image  24, 
while  the  opposite  should  be  true.  Fig.  11  shows  the  top  2-ranked  object  searching  results  after 
examining  local  mapping  in  (a)  and  (b).  Here  Image  24  is  ranked  first  and  Image  46  is  ranked 
second.  And  Image  24  gets  a  much  higher  score  than  Image  46.  This  result  better  reflects  our 
intuition.  And  the  location  and  size  of  the  detected  object  region  in  Fig.  11  (a)  are  much  better 
than  for  Fig.  10  (e). 

As  for  the  detailed  algorithm,  for  a  result  image,  suppose  the  set  of  matched  keypoints  in  the 
query  image  is  S,  and  that  in  the  result  image  it  is  S'.  For  each  matching  pair  of  keypoints  (AT, 
K'),  with  I\  in  S  and  I\'  in  S',  suppose  their  scales  are  a  and  ex’  respectively.  We  generate  adaptive 
sizes  of  neighborhoods  (noted  as  N  and  N')  around  K  and  K'  respectively  in  their  images. 

N  =  {(xi,yi)  |1  <i<n} 

N1  =  {(x'i,y'i)\l  <  i  <  n'}  (7) 

A  =  Y 

n  a 

where  are  the  closest  n  keypoints  to  I\  in  S  in  2D  Euclidean  distance,  and  (a;',  ?/■)  are  the 

closest  n’  keypoints  to  K'  in  S'  in  2D  Euclidean  distance.  Min{n ,  n'}  is  an  adjustable  system 
parameter  for  our  system.  For  our  current  system,  Min{n ,  n'}=5.  That  is,  if  a  <  a\  n= 5,  n' 
is  computed  by  (3),  otherwise  n'= 5,  n  is  computed  by  (7).  If  the  number  of  matched  pairs  ( K  i , 
K2),  with  Ki  in  N  and  I\2  in  AT’,  exceeds  a  threshold,  we  conclude  that  there  is  a  local  mapping 
between  K  and  I\'  and  we  keep  them;  otherwise  we  delete  (K,  Kr)  from  the  matched  keypoint 
set.  The  threshold  here  is  also  an  adjustable  system  parameter;  currently  it  is  2. 

7.2  Geometry  consistency  filter 

After  neighborhood  consistency  filtering,  we  have  more  accurate  matched  keypoint  sets  in  each 
result  image.  But  there  still  exist  some  sets  of  matched  points  that  do  not  satisfy  geometry 
consistency.  Computing  the  homography  by  iteration  is  a  computationally  expensive  approach. 
Fortunately  we  can  make  use  of  the  local  scale  and  orientation  features  of  keypoints  to  simplify 
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the  checking  of  geometry  consistency.  For  each  individually  matched  keypoint  pair,  K 1  in  query 
image  and  K  2  in  result  image,  suppose  ( o\,9\ )  records  the  scale  and  orientation  of  Ki,  and  (Vr2 -6^2 ) 
records  those  of  K2;  we  compute  the  scale  change  ratio  a  and  orientation  change  angle  9. 

a  =  a2/(Ji  (8) 

9  =  02-61  (9) 

These  two  parameters  {<7,9)  are  the  scaling  (zooming)  parameter  and  rotation  parameter  for  the 
transformation  from  Ki  to  K2.  If  the  two  groups  of  keypoints  really  represent  the  same  object, 
they  should  conform  to  a  single  planar  homography  transformation  (assuming  their  depth  is  small 
compared  to  their  distance  to  the  camera),  so  the  (cr,0)  set  should  be  compact  in  the  2-dimensional 
parameter  space.  Therefore,  we  filter  out  those  individual  keypoint  pair  matches  that  are  sparsely 
distributed  in  the  two-dimensional  parameter  space  (a, 6).  As  for  the  detailed  algorithm,  we  build 
a  2D  parameter  vector  (logo-, 9)  and  normalize  it  into  (Na,  Ng)  e[0,l]x[0,l]  as  described  by  (10) 
and  (11). 

0, . if  (a  <  1/5) 

Na=\  ^f^A/(l/5  <  a  <  5)  (10) 

.  1, . if  {o’  >  5) 

Ng  =  — — ,where9  G  [ — 7r,  7t]  (11) 

27 r 

For  a  parameter  point  ( Na,Ng ),  we  compute  the  local  density  of  parameter  points  around  it 
in  the  parameter  plane  as  (12)  and  (13).  If  the  density  is  lower  than  a  threshold,  we  filter  it  out 
from  the  matched  keypoint  set. 


m 

-££  D((Na,Ne),(N*,N'e)) 
density (Na,  Ng)  =  e  i=1 


(12) 


D({N„  N,),  (X,N‘))  =  y'f/V-  -N„Y  +  min{(N‘  -  Ne)\  (N‘  -N„±  l)2}  (13) 

in  which  D(-,-)  is  a  distance  function  in  the  {Na,  Ng)  parameter  plane.  Here  (i\T,  Ng),. . .  ,(N™,  N™) 
are  the  closest  m  parameter  points  to  (Na,  Ng)  in  S  with  the  distance  function  D ( • , • ) ,  with  rn  as 
an  adjustable  system  parameter.  In  our  current  system,  m= 5.  The  special  distance  function  in 
(13)  is  designed  such  that  accounts  for  the  fact  that  [-7r,7r]  range  is  circular.  (In  normalized  range, 
1  represents  27t). 

Fig.  12  shows  such  an  example.  In  Fig.  12,  (a)  is  the  query  image  with  query  object  as  the 
word  “YES”;  (b)  and  (c)  are  two  tied  object-searching  results  without  geometry  consistency  filter. 
They  both  have  five  matched  keypoints.  But  they  present  different  patterns  in  the  2D  parameter 
space  (. Na,Ng )  as  (d).  With  a  geometry  consistency  filter,  result  2  will  have  higher  rank  than 
result  1,  because  parameters  of  matched  keypoints  in  result  2  are  much  more  compact  in  the  2D 
parameter  space  than  those  in  result  1. 


7.3  Ranking 

After  neighborhood  consistency  filtering  and  geometry  consistency  filtering,  the  remaining  groups 
of  matched  keypoints  in  the  result  frames  are  considered  to  be  really  matched  objects.  At  this 
stage  we  again  adopt  a  very  simple  way  to  rank  frames  by  the  number  of  matched  keypoints  that 
survived  these  filtering  steps.  We  deliver  this  ranked  list  as  the  object  searching  result. 
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(a)  Query  image  (b)  Result  1  (c)  Result  2 

Parameter  distributions 


Normalized  scale  ratio 


(d)  Parameter  distribution 

Figure  12:  Object  search  example  for  criteria  with  geometry  consistency  filter 

8  Appendix  2 

For  object  searching  in  videos,  we  propose  a  framework  for  shot  boundary  detection  and  keyframe 
extraction.  It  handles  detection  of  cuts,  dissolves  and  fades.  Generally,  it  computes  pixel-to- 
neighbor  image  differences  in  the  videos.  The  cut  detection  applies  the  so-called  second-max 
ratio  criterion  in  a  sequential  image  buffer.  The  dissolve  detection  is  based  on  a  skipping  image 
difference  and  linearity  error  in  a  sequential  image  buffer.  We  also  filter  out  the  effects  of  camera 
flash.  For  each  detected  shot,  we  extract  the  middle  frame  as  its  representative  image.  Once 
keyframes  are  extracted,  object  searching  in  videos  is  equivalent  to  object  searching  in  the  database 
of  keyframe  images. 

8.1  Introduction 

Shot  boundary  detection  is  an  essential  elementary  component  of  video  analysis.  There  exist  a 
lot  of  different  shot  boundary  detection  methods  in  the  literature.  In  this  appendix,  we  explain 
the  methods  we  use  in  our  video  research.  It  is  designed  in  a  straightforward  way  from  the 
pixel-to-neighbor  image  differences  in  video  sequences. 

Generally,  there  are  three  kinds  of  shot  boundaries:  cut,  dissolve  and  wipe.  A  cut  is  an  abrupt 
transition  between  shots  which  is  naturally  formed  by  the  video  capturing  process.  A  dissolve  is 
a  gradual  transition  between  shots,  which  is  an  effect  added  by  video  editors  where  two  adjacent 
shots  are  partly  overlapped,  while  the  frame  intensities  of  the  first  shot  are  decreased  to  zero  and 
the  frame  intensities  of  the  second  shot  are  increased  from  zero.  In  fade-in  and  fade-out,  the 
two  shots  are  not  overlapped  but  the  variations  of  frame  intensities  in  the  two  adjacent  shots  are 
similar  to  those  in  a  dissolve.  Therefore  we  use  the  same  detection  method  and  complete  it  with  a 
post-processing  step  to  account  for  the  fact  that  the  shots  are  not  overlapped.  A  wipe  is  a  digital 
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video  effect  also  generated  by  video  editors  that  can  have  many  different  forms.  In  a  wipe,  one 
new  shot  pushes  away  an  old  shot.  In  this  appendix  we  only  describe  algorithms  for  cut  detection, 
dissolve  and  fade  detection. 


8.2  Cut  Detection 

8.2.1  Second-Max  Ratio  Criterion 


The  image  difference  between  two  adjacent  frames  can  be  a  good  cue  for  detecting  cuts.  But 
motion  within  one  shot  may  be  so  large  that  it  can  also  cause  a  noticeable  image  difference  and 
can  be  confused  as  a  cut.  To  deal  with  motion,  we  use  the  second- max  ratio  criterion  in  the  image 
difference  sequence  to  detect  cuts.  The  second- max  ratio,  R(t),  is  defined  as 


m 


d(t ) 


TnciXf  £[t—wi,t+wi\,t'^td(t ) 


(14) 


in  which  d(t)  =  \\I(t  +  1)  —  I(t)\\  is  the  pixel-to-neighbor  image  difference  between  two  adjacent 
frames  and  is  described  in  the  next  section,  w\  is  the  half-width  of  a  sliding  window,  and  max  d(tr) 
is  the  maximum  among  all  image  differences  d(t)  between  adjacent  frames  in  the  sliding  window, 
excluding  the  frame  at  time  =  t  considered  at  the  numerator  of  the  expression.  Therefore,  when 
that  frame  is  located  at  a  maximum  of  d(t),  the  denominator  selects  the  second  maximum  in  the 
sliding  window,  since  the  first  one  is  excluded  from  consideration.  If  the  second  maximum  is  not 
small,  as  is  likely  for  high  level  of  motion,  then  the  ratio  remains  small.  In  case  of  a  cut,  there  is 
no  large  second  maximum,  so  the  ratio  becomes  large  when  frame  t  is  just  past  the  cut.  Fig.  13 
demonstrates  the  effectiveness  of  R(t).  In  this  figure,  there  is  a  segment  (around  position  c)  with 
large  motion.  A  series  of  large  responses  in  are  shown  in  d(t).  But  in  R(t),  these  are  correctly 
eliminated.  All  peaks  in  R(t)  correspond  to  real  cuts. 


d(t) 


R(t) 

Figure  13:  Effectiveness  of  second-max  ratio  criterion  in  detecting  cuts.  Note  that  the  peaks  due 
to  motion  that  appear  in  d(t)  do  not  appear  in  R(t) 

8.2.2  Image  Difference  Computation 

There  are  many  ways  to  compute  the  difference  between  two  images.  The  following  three  are 
popular: 
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1.  Pixel-to-pixel  difference 

2.  Pixel-to-neighborhood  difference 

3.  Histogram  difference 

We  can  think  of  the  image  difference  as  a  distance  between  two  vectors  in  a  feature  space.  In 
the  first  and  second  methods,  the  features  are  in  the  pixel  space,  with  the  color  of  each  pixel  as 
one  component  of  each  feature.  In  the  third  method  the  features  are  in  histogram  space,  with 
each  histogram  bin  a  component  of  each  feature.  The  first  method  is  not  robust  to  motion  in 
videos.  The  second  method  is  an  advanced  version  of  the  first  method  with  a  degree  of  motion 
compensation.  The  third  method  can  be  computed  by  the  Euclidean  distance  or  any  other  defined 
histogram  distance  in  the  histogram  space.  One  of  the  drawbacks  of  the  third  method  is  that  it 
totally  discards  the  spatial  distribution  of  the  images.  Our  experiments  support  our  analysis  that 
the  second  method,  pixel-to-neighborhood  difference,  is  the  best  for  evaluating  image  difference. 
In  our  implementation  we  compute  the  pixel-to-neighborhood  difference  as 


d(t)  =  i|i(t+i)-i(t)n 

=  c^2(minkeli  —  W2  ,i+w 2]  e  [j—W 2  j  +W2 ]  (  Z  | IKi{t  +  l)[m\  -  Iij(t)[m\\)) 
i,j  m£R,G,B 

(15) 

in  which  c  is  a  scalar  constant,  w2  is  the  semi- width  of  the  motion  compensation  searching  window, 
Ij  j  (t)  [m]  is  the  channel  value  corresponding  to  m  for  the  pixel  position  (i,  j)  of  the  frame  image 
lit),  with  m  €  R,G,  B. 

8.3  Dissolve  Detection 

In  dissolve  detection,  we  cannot  use  the  image  difference  between  two  adjacent  frames,  because 
it  is  small  during  the  dissolve.  But  there  will  be  a  large  image  difference  between  two  frames 
if  we  skip  an  interval  (such  as  a  25  frame  interval).  This  skipping  image  difference  can  be  used 
as  a  cue  to  dissolve  detection.  But  it  is  not  a  unique  criterion  for  finding  a  dissolve.  It  is  very 
common  that  a  peak  of  the  skipping  image  difference  in  a  sequence  is  not  a  dissolve  (it  can  be  a 
cut,  a  wipe,  a  shot  with  large  motion,  or  the  combinations  of  these  events).  Therefore  we  need 
to  add  other  criteria  for  reliable  dissolve  detection.  Another  criterion  we  can  use  is  the  degree  of 
linearity  of  a  sequence  of  frames.  This  is  because,  in  the  mechanism  of  dissolve  effect  generation, 
most  video  editing  processors  use  a  linear  combination  of  the  signal  before  the  dissolve  transition 
(noted  as  signal  A)  and  the  signal  after  the  transition  (noted  as  signal  B).  Thus  in  this  model 
the  dissolve  is  a  simultaneous  fade-out  of  the  signal  A  and  fade-in  of  signal  B.  This  implies  that, 
during  the  dissolve  transition,  images  change  linearly.  Hence  the  degree  of  linearity  can  be  selected 
as  another  criterion  to  dissolve  detection  using  its  generative  properties.  Technically,  it  can  be 
evaluated  from  the  normalized  linear  error  within  a  sequence  of  frames.  We  propose  a  method  for 
dissolve  detection  which  combines  the  above  two  criteria.  Suppose  we  currently  have  a  sequence  of 
w  images  which  is  a  segment  in  the  video  sequence  starting  with  I{t)  and  ending  with  I{t  +  w  —  1). 
We  define  the  current  skipping  image  difference  value  as 
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(16) 


m  =  \\i{t+w- 1)  -/(t)n 

—W2,i+W2\,lE[j—W2,j+W2]  (  s  +  ^  -  1)H  -  Ji;i(t)[m]|)) 

i,j  m£R,G,B 


We  define  the  current  normalized  linear  error  along  this  part  of  video  as 

w— 1 

I|/(t  +  1)  -  ((1  -  +  W  —  1))|| 

1=2  m+w-i)-m\\ 

We  detect  a  dissolve  by  the  simultaneous  presence  of  a  peak  of  D{t)  and  a  valley  of  LE(t).  Fig. 
14  shows  an  example  of  detected  dissolves  from  a  news  video.  Here  we  have  four  dissolves  that 
all  satisfy  our  dissolve  model.  Fig.  14  also  illustrates  that  we  can  easily  estimate  the  start  frame 
number  and  end  frame  number  of  a  dissolve  transition. 


Figure  14:  Four  dissolves  with  simultaneous  peak  of  D(t)  (darker  red  curve)  and  valley  of  LE(t ) 
(lighter  green  curve) 

In  (  17)  ,  we  use  the  linearity  assumption  in  RGB  color  space.  One  concern  could  be  that  in 
analog  television  and  in  MPEG  encoding,  other  color  spaces  are  used.  In  which  color  space  is 
the  dissolve  signal  linear?  In  fact,  these  three  color- spaces  are  equivalent  in  linearity,  because 
there  exists  a  linear  transform  between  each  pair  of  the  three  color- spaces.  We  also  did  some 
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experiments  to  show  this  equivalence.  For  example,  Fig.  15  shows  the  LE(t )  curves  for  RGB 
color  space  and  YUV  color  space  and  their  shapes  are  similar.  In  our  implementation,  we  compute 
the  linearity  errors  in  the  RGB  color  space. 


Figure  15:  LE(t )  in  RGB  color  space  (left  picture)  and  YUV  color  space  (right  picture) 


8.4  Elimination  of  Flash 

There  may  be  a  lot  of  camera  flash  in  videos,  especially  in  news  videos.  A  high- intensity  light  of 
very  short  duration  is  produced.  Flash  will  cause  a  false  alarm  for  cut  detection,  because  it  will 
increase  the  global  brightness  of  one  or  more  frames  dramatically.  Our  method  to  eliminate  false 
alarms  caused  by  flash  from  candidate  cuts  uses  the  following  two  facts:  (1)  the  duration  of  flash 
is  very  short,  generally  only  one  or  two  frames,  (2)  the  images  before  a  flash  are  similar  to  the 
images  after  the  flash,  if  they  are  in  the  same  shot.  Following  the  method  described  above,  we  can 
get  candidate  cuts  by  computing  R(t).  These  candidate  cuts  will  be  filtered  by  our  post-processing 
module  to  eliminate  false  alarms  caused  by  flash.  For  a  candidate  cut  position,  noted  as  t,  we 
examine  the  following  four  values:  ||/(t+  1)  —  I(t  —  2)||,  1 1 1(t  +  1)  —  I{t  —  1)|  |,  | \I(t  +  2)  —  I(t)\  |, 
||/(t  +  3)  —  /(f)||.  If  any  of  these  four  values  is  less  than  a  threshold,  this  candidate  is  considered 
to  be  a  false  alarm  caused  by  flash,  and  filtered  out  from  candidate  cuts.  This  algorithm  performs 
quite  well  in  experiments,  and  filters  out  most  of  the  false  alarms  caused  by  flash.  Also  notice 
that,  in  the  special  cases  when  the  flashlight  happens  to  be  at  the  real  cut  boundary  position,  our 
algorithm  also  works.  It  will  not  filter  out  the  cut  because  the  images  before  the  cut  and  those 
after  the  cut  have  large  differences. 

8.5  Merging  Adjacent  Fade-Out  and  Fade-In 

Fade-out  and  fade-in  are  very  common  in  videos.  Fade-out  is  a  dissolve  from  an  image  (noted  as 
image  A)  to  a  black  frame,  while  fade-in  is  a  dissolve  from  a  black  frame  to  an  image  (noted  as 
image  B).  In  most  cases,  a  fade-out  will  be  followed  by  a  fade-in.  In  the  framework  introduced 
above,  we  will  get  two  dissolve  transitions  in  this  case.  One  is  the  preceding  fade-out,  and  the  other 
is  the  subsequent  fade-in.  But  this  is  not  a  reasonable  output.  In  the  shot  boundary  detection 
output,  we  should  deliver  only  one  dissolve  from  image  A  to  image  B.  Therefore  we  have  a  post¬ 
processing  module  to  merge  adjacent  fade-out  and  fade-in  into  one  dissolve  transition.  In  this 
post-processing  module,  for  a  candidate  dissolve  transition,  we  analyze  the  ratio  of  the  number 
of  black  frames  during  the  transition  divided  by  the  total  number  of  frames  of  the  transition.  If 
this  ratio  is  over  0.5,  we  consider  the  transition  to  be  a  fade-in  or  fade-out.  We  merge  two  such 
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Table  4:  Experimental  Results 


Run-id 

Recall 
for  all 

Precision 
for  all 

Recall 
for  cut 

Precision 
for  cut 

Recall 
for  dissolve 

Precision 
for  dissolve 

F-recall 
for  dissolve 

F-precision 
for  dissolve 

DO 

0.802 

0.847 

0.900 

0.877 

0.595 

0.765 

0.567 

0.792 

Dl 

0.816 

0.833 

0.900 

0.880 

0.639 

0.719 

0.536 

0.795 

D2 

0.809 

0.842 

0.900 

0.877 

0.617 

0.750 

0.568 

0.795 

D3 

0.846 

0.693 

0.923 

0.733 

0.685 

0.599 

0.554 

0.741 

D4 

0.731 

0.908 

0.864 

0.921 

0.450 

0.859 

0.656 

0.787 

close  detected  fade-in  /  fade-out  transition  components  to  one  new  dissolve  transition.  In  our 
implementation,  a  frame  is  declared  to  be  a  black  frame  if  at  least  90%  or  all  the  pixels  are  such 
that  R  <  25,  G  <  25  and  B  <  25. 

8.6  Implementation 

In  our  implementation,  we  combined  our  algorithms  with  an  MPEG1  decoder.  In  the  decoding, 
we  keep  a  buffer  of  25  sequential  frames.  It  is  a  cyclic  pointer  array.  Each  pointer  in  the  array  is 
pointing  to  an  image  structure.  Every  time  a  new  frame  is  decoded,  the  frame  buffer  is  updated 
correspondingly  and  the  features  d(t),  R(t),  D(t )  and  LE(t )  are  computed  based  on  the  current 
frame  buffer.  The  update  rule  is  that  the  newly  decoded  frame  comes  in  the  buffer  and  pushes 
out  the  pointer  for  the  oldest  frame.  The  array  is  called  cyclic  because  if  the  last  decoded  frame 
is  replaced  at  the  end  of  the  array,  the  current  newly  decoded  frame  will  be  replaced  at  the 
start  position  of  the  array.  This  array  is  very  convenient  for  memory  management  and  the  only 
overhead  is  a  pointer  pointing  to  the  currently  processed  position  in  the  buffer. 

8.7  Experimental  Results 

We  submitted  five  runs  to  TRECVID  2004.  Our  run  IDs  are  DO,  Dl,  .  .  .  ,  D4.  Generally,  they 
are  obtained  by  the  same  algorithms  for  feature  extraction  and  shot  boundary  detection  decision. 
The  difference  between  the  hve  runs  we  submitted  is  with  the  different  values  of  parameters  in 
our  algorithms,  including  the  w,  wi,w2  parameters  introduced  above,  the  thresholds  for  R(t),  and 
the  thresholds  for  finding  the  peaks  of  D(t)  and  valleys  of  LE(t).  The  results  are  shown  in  Table 
4.  They  show  the  tradeoff  between  precision  and  recall.  F-recall  and  F-precision  in  Table  4  are 
frame-based  recall  and  frame-based  precision  respectively. 

As  a  more  thorough  analysis,  we  find  that  29%  of  all  missed  gradual  transitions  are  caused 
by  wipes.  Wipes  are  difficult  to  detect  effectively  in  videos  because  they  are  of  various  types 
and  the  pixel-to-neightbor  image  differences  have  different  properties  for  different  styles  of  wipes. 
We  also  find  that  25%  of  all  missed  cut  transitions  are  caused  by  extremely  short  dissolves.  In 
TRECVID  2004  data,  there  are  a  lot  of  extremely  short  dissolves  involving  only  three  frames. 
These  transitions  are  detected  as  dissolves  in  our  submission,  but  in  the  TRECVID  ground  truth, 
they  are  counted  as  cuts. 
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8.8  Future  Work 

Papers  submitted  by  other  participants  propose  different  methods  for  doing  shot  boundary  de¬ 
tection.  Tsinghua  University  [26],  RMIT  University  [27],  and  FX  Palo  Alto  Lab  [28]  obtained 
the  best  results.  Tsinghua  University  considers  motion  information  in  shot  boundary  detection. 
RMIT  University  proposes  an  impressive  PrePostRatio  to  detect  gradual  transitions.  FX  Palo 
Alto  Lab  views  shot  boundary  detection  as  a  general  supervised  classification  problem  rather  than 
an  ad-hoc  peak  detection.  All  these  ideas  are  illuminating.  Combining  them  to  our  work  would 
be  helpful.  Moreover,  our  own  methods  can  still  be  improved.  For  example,  in  the  current  im¬ 
plementation,  the  width  of  the  sliding  window  is  fixed  at  25  frames.  An  adaptive  sliding  window 
width  may  perform  better. 
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